Comparing Machine Learning algorithms

In our paper, we chose the Decision Tree classifier provided by the scikit-learn package to test our Provenance Network Analytics approach on three applications (see the Overview for links to those). It implements an optimised version of the CART algorithm and performs quite fast compared to other classification algorithms.

In this notebook, we explore other classification algorithms also available in the scikit-learn package and compare their performance with Decision Tree classifiers in our three applications. The algorithms tested against Decision Tree classifiers are:

Since our main purpose is to gain a rough understanding of the relative benefit (accuracy) vs cost (time) of the above algorithms against Decision Tree classifiers, optimising the parameters of those algorithms is not in the scope of our experiments. We will use the default settings as specified by scikit-learn (which are normally sensible, anyway).


In [1]:
import pandas as pd
import numpy as np
from scipy import stats
from analytics import balance_smote

import warnings
warnings.filterwarnings("ignore", category=UserWarning)

Loading and preparing data


In [2]:
datasets = []  # list of the datasets to be tested with the above classifiers

ProvStore Documents

See Application 1 - ProvStore Documents.ipynb for more details.


In [3]:
df = pd.read_csv("provstore/data.csv")
datasets.append(('ProvStore', df))

CollabMap Datasets

There are three datasets from CollabMap: buildings, routes, and route sets. See Application 2 - CollabMap Data Quality.ipynb for more details.


In [4]:
df = pd.read_csv("collabmap/depgraphs.csv", index_col='id')
trust_threshold = 0.75
df['label'] = df.apply(lambda row: 'Trusted' if row.trust_value >= trust_threshold else 'Uncertain', axis=1)
df.drop('trust_value', axis=1, inplace=True)

datasets.append(('CollabMap/Buildings', df.filter(like="Building", axis=0)))
datasets.append(('CollabMap/Routes', df.filter(regex="^Route\d", axis=0)))
datasets.append(('CollabMap/Routesets', df.filter(like="RouteSet", axis=0)))

Radiation Response Game Dataset

Application 3 - RRG Messages.ipynb


In [5]:
filepath = lambda k: "rrg/depgraphs-%d.csv" % k
label = lambda l: 'other' if l != 'instruction' else l
df = pd.read_csv(filepath(11), index_col=0)
df.label = df.label.apply(label).astype('category')
datasets.append(('RRG/k=11', df))

Testing and collecting measurements

First, we define the Timer class (below, taken from this recipe) to measure the computing time of training and classification. Note that this class allows us to disable garbage collection between measurements for more consistent results.


In [6]:
import gc
import timeit

class Timer:
    def __init__(self, timer=None, disable_gc=False, verbose=True):
        if timer is None:
            timer = timeit.default_timer
        self.timer = timer
        self.disable_gc = disable_gc
        self.gc_state = None
        self.verbose = verbose
        self.start = self.end = self.interval = None
        
    def __enter__(self):
        if self.disable_gc:
            self.gc_state = gc.isenabled()
            gc.disable()
        self.start = self.timer()
        return self

    def __exit__(self, *args):
        self.end = self.timer()
        if self.disable_gc and self.gc_state:
            gc.enable()
            self.gc_state = None            
        self.interval = self.end - self.start
        if self.verbose:
            print('time taken: %f seconds' % self.interval)

Next, we list the classifiers to be trained and tested in our experiment in classifier_classes.


In [7]:
from sklearn.model_selection import StratifiedKFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.linear_model import SGDClassifier
from sklearn.neural_network import MLPClassifier

classifier_classes = [
    DecisionTreeClassifier,
    SVC,
    KNeighborsClassifier,
    RandomForestClassifier,
    AdaBoostClassifier,
    GradientBoostingClassifier,
    GaussianNB,
    QuadraticDiscriminantAnalysis,
    SGDClassifier,
    MLPClassifier
]

For each dataset, we balance the data using the SMOTE method (as described in here). We then carry out 10-fold cross validation tests, splitting the balanced dataset into training set and test set in 10 iterations. In each iteration, every classifier from the classifier_classes list above is trained and tested with the same training set and test set. The results are returned in a DataFrame whose rows tell us the dataset and classifier used along with its accuracy and computing time.


In [8]:
columns = ['dataset', 'classifier', 'accuracy', 'time']
def run_test_on(dataset_name, df):
    df = balance_smote(df)
    X = df.drop('label', axis=1)
    Y = df.label
    results = []
    
    skf = StratifiedKFold(n_splits=10, shuffle=True)
    for train, test in skf.split(X, Y):
        for clf_cls in classifier_classes:
            classifier_name = clf_cls.__name__
            timer = Timer(disable_gc=True, verbose=False)
            with timer:
                clf = clf_cls()
                clf.fit(X.iloc[train], Y.iloc[train])
                accuracy_score = clf.score(X.iloc[test], Y.iloc[test])
            results.append((dataset_name, classifier_name, accuracy_score, timer.interval))
            print(results[-1])
            
    performance = pd.DataFrame(results, columns=columns)
    performance.classifier = performance.classifier.astype('category')
    return performance

Since we want to compare the other classifier against Decision Tree classifiers, we use the performance of the Decision Tree classifiers as the base line. The normalisation function below divides the accuracy and computing time of all classifiers by the mean accuracy and mean computing time of Decision Tree classifiers, respectively.


In [9]:
def normalise_accuracy_time(performance, baseline_classifier="DecisionTreeClassifier"):
    baseline_accuracy, baseline_time = performance[performance.classifier == baseline_classifier].mean()
    performance.accuracy = performance.accuracy / baseline_accuracy
    performance.time = performance.time / baseline_time

Experiment: Using the two functions above, the code below iterates over the 5 datasets loaded in the datasets list in the previous section, it run the performance test on each dataset, normalise the results, and append it to the performance DataFrame.

Caution: Running the code below may take about 50 minutes to finish.

Note on warnings below: A few data points are not suitable for certain algorithms (e.g. GradientBoostingClassifier, QuadraticDiscriminantAnalysis) and generate warnings. For our purpose of comparing the accuracy and computation cost of the selected algorithms, they can be safely ignored.


In [10]:
performance = pd.DataFrame(columns=columns)
for dataset_name, df in datasets:
    results = run_test_on(dataset_name, df)
    normalise_accuracy_time(results)
    performance = performance.append(results, ignore_index=True)


Original data shapes: (13870, 22) (13870,)
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/imblearn/base.py:306: UserWarning: The target type should be binary.
  warnings.warn('The target type should be binary.')
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/imblearn/base.py:306: UserWarning: The target type should be binary.
  warnings.warn('The target type should be binary.')
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/imblearn/base.py:306: UserWarning: The target type should be binary.
  warnings.warn('The target type should be binary.')
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/imblearn/base.py:306: UserWarning: The target type should be binary.
  warnings.warn('The target type should be binary.')
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/imblearn/base.py:306: UserWarning: The target type should be binary.
  warnings.warn('The target type should be binary.')
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/imblearn/base.py:306: UserWarning: The target type should be binary.
  warnings.warn('The target type should be binary.')
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/imblearn/base.py:306: UserWarning: The target type should be binary.
  warnings.warn('The target type should be binary.')
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/imblearn/base.py:306: UserWarning: The target type should be binary.
  warnings.warn('The target type should be binary.')
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/imblearn/base.py:306: UserWarning: The target type should be binary.
  warnings.warn('The target type should be binary.')
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/imblearn/base.py:306: UserWarning: The target type should be binary.
  warnings.warn('The target type should be binary.')
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/imblearn/base.py:306: UserWarning: The target type should be binary.
  warnings.warn('The target type should be binary.')
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/imblearn/base.py:306: UserWarning: The target type should be binary.
  warnings.warn('The target type should be binary.')
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/imblearn/base.py:306: UserWarning: The target type should be binary.
  warnings.warn('The target type should be binary.')
Balanced data shapes: (94430, 22) (94430,)
('ProvStore', 'DecisionTreeClassifier', 0.98433862433862429, 0.6282739690504968)
('ProvStore', 'SVC', 0.97724867724867726, 54.664983707945794)
('ProvStore', 'KNeighborsClassifier', 0.97798941798941796, 0.6841244460083544)
('ProvStore', 'RandomForestClassifier', 0.98529100529100533, 0.9375908700749278)
('ProvStore', 'AdaBoostClassifier', 0.24137566137566138, 11.694342364091426)
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/sklearn/ensemble/gradient_boosting.py:583: RuntimeWarning: overflow encountered in double_scalars
  tree.value[leaf, 0, 0] = numerator / denominator
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/sklearn/utils/extmath.py:410: RuntimeWarning: invalid value encountered in subtract
  out = np.log(np.sum(np.exp(arr - vmax), axis=0))
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/sklearn/ensemble/gradient_boosting.py:558: RuntimeWarning: invalid value encountered in multiply
  return np.sum(-1 * sample_weight * (Y * pred).sum(axis=1) +
('ProvStore', 'GradientBoostingClassifier', 0.96719576719576716, 177.97722859308124)
('ProvStore', 'GaussianNB', 0.63534391534391532, 0.1893952637910843)
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/sklearn/discriminant_analysis.py:719: RuntimeWarning: divide by zero encountered in power
  X2 = np.dot(Xm, R * (S ** (-0.5)))
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/sklearn/discriminant_analysis.py:719: RuntimeWarning: invalid value encountered in multiply
  X2 = np.dot(Xm, R * (S ** (-0.5)))
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/sklearn/discriminant_analysis.py:722: RuntimeWarning: divide by zero encountered in log
  u = np.asarray([np.sum(np.log(s)) for s in self.scalings_])
('ProvStore', 'QuadraticDiscriminantAnalysis', 0.071428571428571425, 0.2196116358973086)
('ProvStore', 'SGDClassifier', 0.58592592592592596, 0.95608109398745)
('ProvStore', 'MLPClassifier', 0.97873015873015878, 7.766927886987105)
('ProvStore', 'DecisionTreeClassifier', 0.98317460317460315, 0.6422885661013424)
('ProvStore', 'SVC', 0.97523809523809524, 56.62782682082616)
('ProvStore', 'KNeighborsClassifier', 0.98232804232804238, 0.7215720429085195)
('ProvStore', 'RandomForestClassifier', 0.9838095238095238, 0.9315545791760087)
('ProvStore', 'AdaBoostClassifier', 0.23883597883597885, 12.325766609050333)
('ProvStore', 'GradientBoostingClassifier', 0.98126984126984129, 205.08686965494417)
('ProvStore', 'GaussianNB', 0.63195767195767194, 0.19474898418411613)
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/sklearn/discriminant_analysis.py:719: RuntimeWarning: divide by zero encountered in power
  X2 = np.dot(Xm, R * (S ** (-0.5)))
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/sklearn/discriminant_analysis.py:719: RuntimeWarning: invalid value encountered in multiply
  X2 = np.dot(Xm, R * (S ** (-0.5)))
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/sklearn/discriminant_analysis.py:722: RuntimeWarning: divide by zero encountered in log
  u = np.asarray([np.sum(np.log(s)) for s in self.scalings_])
('ProvStore', 'QuadraticDiscriminantAnalysis', 0.071428571428571425, 0.21984002296812832)
('ProvStore', 'SGDClassifier', 0.63460317460317461, 1.0515488469973207)
('ProvStore', 'MLPClassifier', 0.97894179894179889, 17.490266302833334)
('ProvStore', 'DecisionTreeClassifier', 0.98042328042328042, 0.7004924649372697)
('ProvStore', 'SVC', 0.97492063492063497, 60.00700272107497)
('ProvStore', 'KNeighborsClassifier', 0.97492063492063497, 0.7562006588559598)
('ProvStore', 'RandomForestClassifier', 0.98137566137566135, 1.017202251125127)
('ProvStore', 'AdaBoostClassifier', 0.24116402116402116, 12.618408649927005)
('ProvStore', 'GradientBoostingClassifier', 0.98052910052910058, 191.60630124108866)
('ProvStore', 'GaussianNB', 0.63100529100529101, 0.18379174708388746)
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/sklearn/discriminant_analysis.py:719: RuntimeWarning: divide by zero encountered in power
  X2 = np.dot(Xm, R * (S ** (-0.5)))
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/sklearn/discriminant_analysis.py:719: RuntimeWarning: invalid value encountered in multiply
  X2 = np.dot(Xm, R * (S ** (-0.5)))
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/sklearn/discriminant_analysis.py:722: RuntimeWarning: divide by zero encountered in log
  u = np.asarray([np.sum(np.log(s)) for s in self.scalings_])
('ProvStore', 'QuadraticDiscriminantAnalysis', 0.071428571428571425, 0.20788549608550966)
('ProvStore', 'SGDClassifier', 0.56105820105820103, 0.9839116588700563)
('ProvStore', 'MLPClassifier', 0.97915343915343911, 13.864243179094046)
('ProvStore', 'DecisionTreeClassifier', 0.982010582010582, 0.6382025182247162)
('ProvStore', 'SVC', 0.97587301587301589, 54.57399194291793)
('ProvStore', 'KNeighborsClassifier', 0.982010582010582, 0.7025791951455176)
('ProvStore', 'RandomForestClassifier', 0.98370370370370375, 0.9280164430383593)
('ProvStore', 'AdaBoostClassifier', 0.24359788359788359, 12.009571711998433)
('ProvStore', 'GradientBoostingClassifier', 0.98243386243386244, 187.4014346669428)
('ProvStore', 'GaussianNB', 0.64074074074074072, 0.1926153169479221)
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/sklearn/discriminant_analysis.py:719: RuntimeWarning: divide by zero encountered in power
  X2 = np.dot(Xm, R * (S ** (-0.5)))
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/sklearn/discriminant_analysis.py:719: RuntimeWarning: invalid value encountered in multiply
  X2 = np.dot(Xm, R * (S ** (-0.5)))
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/sklearn/discriminant_analysis.py:722: RuntimeWarning: divide by zero encountered in log
  u = np.asarray([np.sum(np.log(s)) for s in self.scalings_])
('ProvStore', 'QuadraticDiscriminantAnalysis', 0.071428571428571425, 0.20157398888841271)
('ProvStore', 'SGDClassifier', 0.66264550264550259, 0.9724252661690116)
('ProvStore', 'MLPClassifier', 0.97978835978835976, 24.709092132980004)
('ProvStore', 'DecisionTreeClassifier', 0.98116402116402113, 0.6377822048962116)
('ProvStore', 'SVC', 0.97576719576719573, 55.80485440604389)
('ProvStore', 'KNeighborsClassifier', 0.97798941798941796, 0.7254003388807178)
('ProvStore', 'RandomForestClassifier', 0.98222222222222222, 0.9230234220158309)
('ProvStore', 'AdaBoostClassifier', 0.23904761904761904, 12.039744910085574)
('ProvStore', 'GradientBoostingClassifier', 0.98126984126984129, 189.90324166906066)
('ProvStore', 'GaussianNB', 0.63417989417989418, 0.1964319630060345)
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/sklearn/discriminant_analysis.py:719: RuntimeWarning: divide by zero encountered in power
  X2 = np.dot(Xm, R * (S ** (-0.5)))
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/sklearn/discriminant_analysis.py:719: RuntimeWarning: invalid value encountered in multiply
  X2 = np.dot(Xm, R * (S ** (-0.5)))
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/sklearn/discriminant_analysis.py:722: RuntimeWarning: divide by zero encountered in log
  u = np.asarray([np.sum(np.log(s)) for s in self.scalings_])
('ProvStore', 'QuadraticDiscriminantAnalysis', 0.071428571428571425, 0.20328653207980096)
('ProvStore', 'SGDClassifier', 0.39915343915343915, 0.9719647150486708)
('ProvStore', 'MLPClassifier', 0.97746031746031747, 11.966222131159157)
('ProvStore', 'DecisionTreeClassifier', 0.97838066977532856, 0.6494953420478851)
('ProvStore', 'SVC', 0.97329376854599403, 55.16984712891281)
('ProvStore', 'KNeighborsClassifier', 0.9358838490885969, 0.7060083199758083)
('ProvStore', 'RandomForestClassifier', 0.97965239508266211, 0.9225535469595343)
('ProvStore', 'AdaBoostClassifier', 0.24258160237388723, 12.090174629818648)
('ProvStore', 'GradientBoostingClassifier', 0.97700296735905046, 190.71470193611458)
('ProvStore', 'GaussianNB', 0.63342518016108518, 0.19647060008719563)
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/sklearn/discriminant_analysis.py:719: RuntimeWarning: divide by zero encountered in power
  X2 = np.dot(Xm, R * (S ** (-0.5)))
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/sklearn/discriminant_analysis.py:719: RuntimeWarning: invalid value encountered in multiply
  X2 = np.dot(Xm, R * (S ** (-0.5)))
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/sklearn/discriminant_analysis.py:722: RuntimeWarning: divide by zero encountered in log
  u = np.asarray([np.sum(np.log(s)) for s in self.scalings_])
('ProvStore', 'QuadraticDiscriminantAnalysis', 0.071428571428571425, 0.20877749798819423)
('ProvStore', 'SGDClassifier', 0.40525646460364562, 0.9651722251437604)
('ProvStore', 'MLPClassifier', 0.97806273844849512, 28.429929216858)
('ProvStore', 'DecisionTreeClassifier', 0.98346757100466298, 0.6399862139951438)
('ProvStore', 'SVC', 0.97657905892327257, 54.96170974592678)
('ProvStore', 'KNeighborsClassifier', 0.97912250953793978, 0.7141891859937459)
('ProvStore', 'RandomForestClassifier', 0.98389147944044086, 0.918852160917595)
('ProvStore', 'AdaBoostClassifier', 0.24459516744383214, 12.048601221991703)
('ProvStore', 'GradientBoostingClassifier', 0.98325561678677409, 177.18023935798556)
('ProvStore', 'GaussianNB', 0.6300339126748622, 0.18278415803797543)
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/sklearn/discriminant_analysis.py:719: RuntimeWarning: divide by zero encountered in power
  X2 = np.dot(Xm, R * (S ** (-0.5)))
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/sklearn/discriminant_analysis.py:719: RuntimeWarning: invalid value encountered in multiply
  X2 = np.dot(Xm, R * (S ** (-0.5)))
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/sklearn/discriminant_analysis.py:722: RuntimeWarning: divide by zero encountered in log
  u = np.asarray([np.sum(np.log(s)) for s in self.scalings_])
('ProvStore', 'QuadraticDiscriminantAnalysis', 0.071428571428571425, 0.1987031849566847)
('ProvStore', 'SGDClassifier', 0.62484103433658333, 0.9174390151165426)
('ProvStore', 'MLPClassifier', 0.98071216617210677, 17.275759185198694)
('ProvStore', 'DecisionTreeClassifier', 0.97997032640949555, 0.6224692738614976)
('ProvStore', 'SVC', 0.97286986011021614, 53.502078492892906)
('ProvStore', 'KNeighborsClassifier', 0.97456549385332769, 0.6719589701388031)
('ProvStore', 'RandomForestClassifier', 0.98050021195421788, 0.8765174869913608)
('ProvStore', 'AdaBoostClassifier', 0.23749470114455279, 11.357924479059875)
('ProvStore', 'GradientBoostingClassifier', 0.95803306485799067, 171.1285808600951)
('ProvStore', 'GaussianNB', 0.63183552352691819, 0.1877018720842898)
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/sklearn/discriminant_analysis.py:719: RuntimeWarning: divide by zero encountered in power
  X2 = np.dot(Xm, R * (S ** (-0.5)))
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/sklearn/discriminant_analysis.py:719: RuntimeWarning: invalid value encountered in multiply
  X2 = np.dot(Xm, R * (S ** (-0.5)))
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/sklearn/discriminant_analysis.py:722: RuntimeWarning: divide by zero encountered in log
  u = np.asarray([np.sum(np.log(s)) for s in self.scalings_])
('ProvStore', 'QuadraticDiscriminantAnalysis', 0.071428571428571425, 0.1983064040541649)
('ProvStore', 'SGDClassifier', 0.14444679949130987, 0.9076564291026443)
('ProvStore', 'MLPClassifier', 0.97742687579482834, 18.265281066996977)
('ProvStore', 'DecisionTreeClassifier', 0.98050021195421788, 0.6138428659178317)
('ProvStore', 'SVC', 0.97424756252649425, 52.9012228010688)
('ProvStore', 'KNeighborsClassifier', 0.97965239508266211, 0.6679396738763899)
('ProvStore', 'RandomForestClassifier', 0.98134802882577366, 0.8896840901579708)
('ProvStore', 'AdaBoostClassifier', 0.20199236964815601, 11.323103261878714)
('ProvStore', 'GradientBoostingClassifier', 0.97986434930055111, 171.47252991888672)
('ProvStore', 'GaussianNB', 0.63543874523103006, 0.18165320483967662)
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/sklearn/discriminant_analysis.py:719: RuntimeWarning: divide by zero encountered in power
  X2 = np.dot(Xm, R * (S ** (-0.5)))
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/sklearn/discriminant_analysis.py:719: RuntimeWarning: invalid value encountered in multiply
  X2 = np.dot(Xm, R * (S ** (-0.5)))
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/sklearn/discriminant_analysis.py:722: RuntimeWarning: divide by zero encountered in log
  u = np.asarray([np.sum(np.log(s)) for s in self.scalings_])
('ProvStore', 'QuadraticDiscriminantAnalysis', 0.071428571428571425, 0.19786123000085354)
('ProvStore', 'SGDClassifier', 0.54069520983467567, 0.9000884960405529)
('ProvStore', 'MLPClassifier', 0.97816871555743956, 22.837120848009363)
('ProvStore', 'DecisionTreeClassifier', 0.98081814328105132, 0.6138826361857355)
('ProvStore', 'SVC', 0.97371767698177192, 53.480530600994825)
('ProvStore', 'KNeighborsClassifier', 0.97615515048749468, 0.6645809879992157)
('ProvStore', 'RandomForestClassifier', 0.98166596015260699, 0.8876681609544903)
('ProvStore', 'AdaBoostClassifier', 0.24003815175922, 11.361000906908885)
('ProvStore', 'GradientBoostingClassifier', 0.97816871555743956, 171.13530291896313)
('ProvStore', 'GaussianNB', 0.63702840186519716, 0.18148943991400301)
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/sklearn/discriminant_analysis.py:719: RuntimeWarning: divide by zero encountered in power
  X2 = np.dot(Xm, R * (S ** (-0.5)))
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/sklearn/discriminant_analysis.py:719: RuntimeWarning: invalid value encountered in multiply
  X2 = np.dot(Xm, R * (S ** (-0.5)))
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/sklearn/discriminant_analysis.py:722: RuntimeWarning: divide by zero encountered in log
  u = np.asarray([np.sum(np.log(s)) for s in self.scalings_])
('ProvStore', 'QuadraticDiscriminantAnalysis', 0.071428571428571425, 0.19680562405847013)
('ProvStore', 'SGDClassifier', 0.58488766426451888, 0.8985274231527001)
('ProvStore', 'MLPClassifier', 0.9773208986858839, 15.01476671709679)
Original data shapes: (5175, 22) (5175,)
Balanced data shapes: (8982, 22) (8982,)
('CollabMap/Buildings', 'DecisionTreeClassifier', 0.88777777777777778, 0.016987378941848874)
('CollabMap/Buildings', 'SVC', 0.88777777777777778, 0.654363403795287)
('CollabMap/Buildings', 'KNeighborsClassifier', 0.88555555555555554, 0.05372064607217908)
('CollabMap/Buildings', 'RandomForestClassifier', 0.88777777777777778, 0.034585230983793736)
('CollabMap/Buildings', 'AdaBoostClassifier', 0.88666666666666671, 0.5658830350730568)
('CollabMap/Buildings', 'GradientBoostingClassifier', 0.88666666666666671, 0.3947491920553148)
('CollabMap/Buildings', 'GaussianNB', 0.81444444444444442, 0.012314618099480867)
('CollabMap/Buildings', 'QuadraticDiscriminantAnalysis', 0.5, 0.018667222931981087)
('CollabMap/Buildings', 'SGDClassifier', 0.86111111111111116, 0.008864692877978086)
('CollabMap/Buildings', 'MLPClassifier', 0.87666666666666671, 0.6421780500095338)
('CollabMap/Buildings', 'DecisionTreeClassifier', 0.90423162583518935, 0.01484856684692204)
('CollabMap/Buildings', 'SVC', 0.90645879732739421, 0.6500226748175919)
('CollabMap/Buildings', 'KNeighborsClassifier', 0.9031180400890868, 0.053719501942396164)
('CollabMap/Buildings', 'RandomForestClassifier', 0.90757238307349664, 0.03554195910692215)
('CollabMap/Buildings', 'AdaBoostClassifier', 0.90645879732739421, 0.5695940731093287)
('CollabMap/Buildings', 'GradientBoostingClassifier', 0.90979955456570161, 0.3946095979772508)
('CollabMap/Buildings', 'GaussianNB', 0.81514476614699327, 0.012178122997283936)
('CollabMap/Buildings', 'QuadraticDiscriminantAnalysis', 0.5, 0.016865471145138144)
('CollabMap/Buildings', 'SGDClassifier', 0.89755011135857465, 0.009447709890082479)
('CollabMap/Buildings', 'MLPClassifier', 0.85968819599109136, 0.6124980931635946)
('CollabMap/Buildings', 'DecisionTreeClassifier', 0.90423162583518935, 0.015430834144353867)
('CollabMap/Buildings', 'SVC', 0.9031180400890868, 0.6543858039658517)
('CollabMap/Buildings', 'KNeighborsClassifier', 0.90200445434298437, 0.05414189305156469)
('CollabMap/Buildings', 'RandomForestClassifier', 0.90534521158129178, 0.03505380009301007)
('CollabMap/Buildings', 'AdaBoostClassifier', 0.9031180400890868, 0.5720462128520012)
('CollabMap/Buildings', 'GradientBoostingClassifier', 0.90423162583518935, 0.39682423789054155)
('CollabMap/Buildings', 'GaussianNB', 0.83518930957683746, 0.012489131884649396)
('CollabMap/Buildings', 'QuadraticDiscriminantAnalysis', 0.5, 0.01576193398796022)
('CollabMap/Buildings', 'SGDClassifier', 0.89977728285077951, 0.008987564826384187)
('CollabMap/Buildings', 'MLPClassifier', 0.88641425389755013, 0.5565136370714754)
('CollabMap/Buildings', 'DecisionTreeClassifier', 0.88752783964365256, 0.01517017581500113)
('CollabMap/Buildings', 'SVC', 0.88864142538975499, 0.6576030489522964)
('CollabMap/Buildings', 'KNeighborsClassifier', 0.88864142538975499, 0.05321069899946451)
('CollabMap/Buildings', 'RandomForestClassifier', 0.88864142538975499, 0.034273005090653896)
('CollabMap/Buildings', 'AdaBoostClassifier', 0.88975501113585742, 0.5650485181249678)
('CollabMap/Buildings', 'GradientBoostingClassifier', 0.88975501113585742, 0.3973813729826361)
('CollabMap/Buildings', 'GaussianNB', 0.80178173719376389, 0.011975385947152972)
('CollabMap/Buildings', 'QuadraticDiscriminantAnalysis', 0.5, 0.015171959064900875)
('CollabMap/Buildings', 'SGDClassifier', 0.79955456570155903, 0.008932056836783886)
('CollabMap/Buildings', 'MLPClassifier', 0.88084632516703787, 0.6642728180158883)
('CollabMap/Buildings', 'DecisionTreeClassifier', 0.90089086859688194, 0.015501237008720636)
('CollabMap/Buildings', 'SVC', 0.89643652561247211, 0.6675954160746187)
('CollabMap/Buildings', 'KNeighborsClassifier', 0.89420935412026725, 0.05456989607773721)
('CollabMap/Buildings', 'RandomForestClassifier', 0.90089086859688194, 0.03459863387979567)
('CollabMap/Buildings', 'AdaBoostClassifier', 0.89866369710467708, 0.5672015990130603)
('CollabMap/Buildings', 'GradientBoostingClassifier', 0.90089086859688194, 0.4064055010676384)
('CollabMap/Buildings', 'GaussianNB', 0.7984409799554566, 0.011983428848907351)
('CollabMap/Buildings', 'QuadraticDiscriminantAnalysis', 0.5, 0.015305576846003532)
('CollabMap/Buildings', 'SGDClassifier', 0.5, 0.009027238003909588)
('CollabMap/Buildings', 'MLPClassifier', 0.88975501113585742, 0.8404242810793221)
('CollabMap/Buildings', 'DecisionTreeClassifier', 0.9164810690423163, 0.015215085120871663)
('CollabMap/Buildings', 'SVC', 0.91202672605790647, 0.6739825340919197)
('CollabMap/Buildings', 'KNeighborsClassifier', 0.91202672605790647, 0.05468761106021702)
('CollabMap/Buildings', 'RandomForestClassifier', 0.9164810690423163, 0.03474461194127798)
('CollabMap/Buildings', 'AdaBoostClassifier', 0.9164810690423163, 0.5646350081078708)
('CollabMap/Buildings', 'GradientBoostingClassifier', 0.9164810690423163, 0.3944222799036652)
('CollabMap/Buildings', 'GaussianNB', 0.80957683741648112, 0.01192634692415595)
('CollabMap/Buildings', 'QuadraticDiscriminantAnalysis', 0.5, 0.015321757178753614)
('CollabMap/Buildings', 'SGDClassifier', 0.91091314031180404, 0.008777210023254156)
('CollabMap/Buildings', 'MLPClassifier', 0.9031180400890868, 0.5597613728605211)
('CollabMap/Buildings', 'DecisionTreeClassifier', 0.90423162583518935, 0.015363594982773066)
('CollabMap/Buildings', 'SVC', 0.89977728285077951, 0.6706824710126966)
('CollabMap/Buildings', 'KNeighborsClassifier', 0.89977728285077951, 0.055158216040581465)
('CollabMap/Buildings', 'RandomForestClassifier', 0.9031180400890868, 0.03474282193928957)
('CollabMap/Buildings', 'AdaBoostClassifier', 0.90423162583518935, 0.5641248559113592)
('CollabMap/Buildings', 'GradientBoostingClassifier', 0.9031180400890868, 0.40329694002866745)
('CollabMap/Buildings', 'GaussianNB', 0.8229398663697105, 0.012779850978404284)
('CollabMap/Buildings', 'QuadraticDiscriminantAnalysis', 0.5, 0.01747719501145184)
('CollabMap/Buildings', 'SGDClassifier', 0.89643652561247211, 0.010938325896859169)
('CollabMap/Buildings', 'MLPClassifier', 0.89643652561247211, 0.4319399781525135)
('CollabMap/Buildings', 'DecisionTreeClassifier', 0.91425389755011133, 0.01533547998405993)
('CollabMap/Buildings', 'SVC', 0.91202672605790647, 0.674257131991908)
('CollabMap/Buildings', 'KNeighborsClassifier', 0.91202672605790647, 0.05335550894960761)
('CollabMap/Buildings', 'RandomForestClassifier', 0.9164810690423163, 0.03643997898325324)
('CollabMap/Buildings', 'AdaBoostClassifier', 0.91202672605790647, 0.5689118970185518)
('CollabMap/Buildings', 'GradientBoostingClassifier', 0.91425389755011133, 0.39761419710703194)
('CollabMap/Buildings', 'GaussianNB', 0.80846325167037858, 0.01226760190911591)
('CollabMap/Buildings', 'QuadraticDiscriminantAnalysis', 0.5, 0.015477162087336183)
('CollabMap/Buildings', 'SGDClassifier', 0.86748329621380849, 0.009108009049668908)
('CollabMap/Buildings', 'MLPClassifier', 0.90423162583518935, 0.35561410686932504)
('CollabMap/Buildings', 'DecisionTreeClassifier', 0.90979955456570161, 0.01639224193058908)
('CollabMap/Buildings', 'SVC', 0.90868596881959907, 0.6748741089832038)
('CollabMap/Buildings', 'KNeighborsClassifier', 0.90868596881959907, 0.0544117649551481)
('CollabMap/Buildings', 'RandomForestClassifier', 0.90868596881959907, 0.03523489483632147)
('CollabMap/Buildings', 'AdaBoostClassifier', 0.90757238307349664, 0.5674329781904817)
('CollabMap/Buildings', 'GradientBoostingClassifier', 0.90868596881959907, 0.3978273479733616)
('CollabMap/Buildings', 'GaussianNB', 0.80066815144766146, 0.011915349867194891)
('CollabMap/Buildings', 'QuadraticDiscriminantAnalysis', 0.5, 0.015337583143264055)
('CollabMap/Buildings', 'SGDClassifier', 0.90423162583518935, 0.008761785924434662)
('CollabMap/Buildings', 'MLPClassifier', 0.89866369710467708, 0.5861516550648957)
('CollabMap/Buildings', 'DecisionTreeClassifier', 0.90868596881959907, 0.015345130814239383)
('CollabMap/Buildings', 'SVC', 0.91091314031180404, 0.6742275769356638)
('CollabMap/Buildings', 'KNeighborsClassifier', 0.90979955456570161, 0.05475896107964218)
('CollabMap/Buildings', 'RandomForestClassifier', 0.9131403118040089, 0.03464117995463312)
('CollabMap/Buildings', 'AdaBoostClassifier', 0.91091314031180404, 0.565354464109987)
('CollabMap/Buildings', 'GradientBoostingClassifier', 0.9131403118040089, 0.39780942001380026)
('CollabMap/Buildings', 'GaussianNB', 0.8229398663697105, 0.012065473012626171)
('CollabMap/Buildings', 'QuadraticDiscriminantAnalysis', 0.5, 0.015248610870912671)
('CollabMap/Buildings', 'SGDClassifier', 0.5, 0.008899723878130317)
('CollabMap/Buildings', 'MLPClassifier', 0.90200445434298437, 0.43090497702360153)
Original data shapes: (4997, 22) (4997,)
Balanced data shapes: (7816, 22) (7816,)
('CollabMap/Routes', 'DecisionTreeClassifier', 0.97186700767263423, 0.020979539956897497)
('CollabMap/Routes', 'SVC', 0.95524296675191811, 0.44960711104795337)
('CollabMap/Routes', 'KNeighborsClassifier', 0.97186700767263423, 0.04359858902171254)
('CollabMap/Routes', 'RandomForestClassifier', 0.97058823529411764, 0.0408935840241611)
('CollabMap/Routes', 'AdaBoostClassifier', 0.95524296675191811, 0.5249527900014073)
('CollabMap/Routes', 'GradientBoostingClassifier', 0.96035805626598469, 0.3884893278591335)
('CollabMap/Routes', 'GaussianNB', 0.88363171355498726, 0.010736207943409681)
('CollabMap/Routes', 'QuadraticDiscriminantAnalysis', 0.5, 0.013698538998141885)
('CollabMap/Routes', 'SGDClassifier', 0.86700767263427114, 0.008162094047293067)
('CollabMap/Routes', 'MLPClassifier', 0.89002557544757033, 0.35570379719138145)
('CollabMap/Routes', 'DecisionTreeClassifier', 0.95524296675191811, 0.021168424980714917)
('CollabMap/Routes', 'SVC', 0.95652173913043481, 0.43363479105755687)
('CollabMap/Routes', 'KNeighborsClassifier', 0.94629156010230175, 0.04332349495962262)
('CollabMap/Routes', 'RandomForestClassifier', 0.96803069053708435, 0.04138321802020073)
('CollabMap/Routes', 'AdaBoostClassifier', 0.94757033248081846, 0.5245488719083369)
('CollabMap/Routes', 'GradientBoostingClassifier', 0.9578005115089514, 0.38526863185688853)
('CollabMap/Routes', 'GaussianNB', 0.87340153452685421, 0.010610869154334068)
('CollabMap/Routes', 'QuadraticDiscriminantAnalysis', 0.50383631713554988, 0.013361826073378325)
('CollabMap/Routes', 'SGDClassifier', 0.84015345268542196, 0.008047211915254593)
('CollabMap/Routes', 'MLPClassifier', 0.92199488491048598, 0.6466054818592966)
('CollabMap/Routes', 'DecisionTreeClassifier', 0.979539641943734, 0.021141943987458944)
('CollabMap/Routes', 'SVC', 0.96547314578005117, 0.45452029700390995)
('CollabMap/Routes', 'KNeighborsClassifier', 0.95652173913043481, 0.04406040208414197)
('CollabMap/Routes', 'RandomForestClassifier', 0.97570332480818411, 0.04175944998860359)
('CollabMap/Routes', 'AdaBoostClassifier', 0.96163682864450128, 0.5209596098866314)
('CollabMap/Routes', 'GradientBoostingClassifier', 0.96930946291560105, 0.3918162831105292)
('CollabMap/Routes', 'GaussianNB', 0.88618925831202044, 0.010599188972264528)
('CollabMap/Routes', 'QuadraticDiscriminantAnalysis', 0.5, 0.013487095013260841)
('CollabMap/Routes', 'SGDClassifier', 0.88874680306905374, 0.008146509062498808)
('CollabMap/Routes', 'MLPClassifier', 0.91048593350383633, 0.5307393870316446)
('CollabMap/Routes', 'DecisionTreeClassifier', 0.9578005115089514, 0.021546280942857265)
('CollabMap/Routes', 'SVC', 0.95652173913043481, 0.45405340497381985)
('CollabMap/Routes', 'KNeighborsClassifier', 0.96035805626598469, 0.0447120419703424)
('CollabMap/Routes', 'RandomForestClassifier', 0.97698209718670082, 0.04188215802423656)
('CollabMap/Routes', 'AdaBoostClassifier', 0.9578005115089514, 0.525707176188007)
('CollabMap/Routes', 'GradientBoostingClassifier', 0.96930946291560105, 0.39176974119618535)
('CollabMap/Routes', 'GaussianNB', 0.88363171355498726, 0.010745926992967725)
('CollabMap/Routes', 'QuadraticDiscriminantAnalysis', 0.50127877237851659, 0.013412278145551682)
('CollabMap/Routes', 'SGDClassifier', 0.86700767263427114, 0.008297739084810019)
('CollabMap/Routes', 'MLPClassifier', 0.9156010230179028, 0.5349720451049507)
('CollabMap/Routes', 'DecisionTreeClassifier', 0.97058823529411764, 0.021594892954453826)
('CollabMap/Routes', 'SVC', 0.95907928388746799, 0.4563536951318383)
('CollabMap/Routes', 'KNeighborsClassifier', 0.95396419437340152, 0.043562589911744)
('CollabMap/Routes', 'RandomForestClassifier', 0.97186700767263423, 0.04198571410961449)
('CollabMap/Routes', 'AdaBoostClassifier', 0.95268542199488493, 0.523284254828468)
('CollabMap/Routes', 'GradientBoostingClassifier', 0.96419437340153458, 0.3914957919623703)
('CollabMap/Routes', 'GaussianNB', 0.87851662404092068, 0.010639494052156806)
('CollabMap/Routes', 'QuadraticDiscriminantAnalysis', 0.5, 0.013480408117175102)
('CollabMap/Routes', 'SGDClassifier', 0.89258312020460362, 0.00814447202719748)
('CollabMap/Routes', 'MLPClassifier', 0.92199488491048598, 0.6447383579798043)
('CollabMap/Routes', 'DecisionTreeClassifier', 0.96675191815856776, 0.021754767978563905)
('CollabMap/Routes', 'SVC', 0.96675191815856776, 0.46761976298876107)
('CollabMap/Routes', 'KNeighborsClassifier', 0.9578005115089514, 0.044905869057402015)
('CollabMap/Routes', 'RandomForestClassifier', 0.98337595907928388, 0.04145476594567299)
('CollabMap/Routes', 'AdaBoostClassifier', 0.95524296675191811, 0.5204397179186344)
('CollabMap/Routes', 'GradientBoostingClassifier', 0.97058823529411764, 0.39297017781063914)
('CollabMap/Routes', 'GaussianNB', 0.8964194373401535, 0.011470217956230044)
('CollabMap/Routes', 'QuadraticDiscriminantAnalysis', 0.5, 0.013653832953423262)
('CollabMap/Routes', 'SGDClassifier', 0.88874680306905374, 0.009221930988132954)
('CollabMap/Routes', 'MLPClassifier', 0.93222506393861893, 0.7108156781177968)
('CollabMap/Routes', 'DecisionTreeClassifier', 0.97698209718670082, 0.022520752158015966)
('CollabMap/Routes', 'SVC', 0.95140664961636834, 0.45409965910948813)
('CollabMap/Routes', 'KNeighborsClassifier', 0.95012787723785164, 0.04357666801661253)
('CollabMap/Routes', 'RandomForestClassifier', 0.97698209718670082, 0.04129388392902911)
('CollabMap/Routes', 'AdaBoostClassifier', 0.96035805626598469, 0.5216042711399496)
('CollabMap/Routes', 'GradientBoostingClassifier', 0.96803069053708435, 0.3901908420957625)
('CollabMap/Routes', 'GaussianNB', 0.87212276214833762, 0.010740266181528568)
('CollabMap/Routes', 'QuadraticDiscriminantAnalysis', 0.5, 0.013356228126212955)
('CollabMap/Routes', 'SGDClassifier', 0.76086956521739135, 0.008141199825331569)
('CollabMap/Routes', 'MLPClassifier', 0.89130434782608692, 0.6191923918668181)
('CollabMap/Routes', 'DecisionTreeClassifier', 0.96291560102301788, 0.021467959973961115)
('CollabMap/Routes', 'SVC', 0.94757033248081846, 0.44973524613305926)
('CollabMap/Routes', 'KNeighborsClassifier', 0.95140664961636834, 0.044122966937720776)
('CollabMap/Routes', 'RandomForestClassifier', 0.96675191815856776, 0.04133962909691036)
('CollabMap/Routes', 'AdaBoostClassifier', 0.94501278772378516, 0.5265291421674192)
('CollabMap/Routes', 'GradientBoostingClassifier', 0.9578005115089514, 0.3895133410114795)
('CollabMap/Routes', 'GaussianNB', 0.87723785166240409, 0.010711267124861479)
('CollabMap/Routes', 'QuadraticDiscriminantAnalysis', 0.50127877237851659, 0.013615763979032636)
('CollabMap/Routes', 'SGDClassifier', 0.86956521739130432, 0.008239500923082232)
('CollabMap/Routes', 'MLPClassifier', 0.91943734015345269, 0.6894009760580957)
('CollabMap/Routes', 'DecisionTreeClassifier', 0.95641025641025645, 0.020868079969659448)
('CollabMap/Routes', 'SVC', 0.95641025641025645, 0.4549294929020107)
('CollabMap/Routes', 'KNeighborsClassifier', 0.95897435897435901, 0.043662518030032516)
('CollabMap/Routes', 'RandomForestClassifier', 0.96923076923076923, 0.04213361884467304)
('CollabMap/Routes', 'AdaBoostClassifier', 0.95512820512820518, 0.5255626440048218)
('CollabMap/Routes', 'GradientBoostingClassifier', 0.95897435897435901, 0.3892016881145537)
('CollabMap/Routes', 'GaussianNB', 0.87564102564102564, 0.010490303160622716)
('CollabMap/Routes', 'QuadraticDiscriminantAnalysis', 0.52692307692307694, 0.013371005887165666)
('CollabMap/Routes', 'SGDClassifier', 0.87435897435897436, 0.007904492085799575)
('CollabMap/Routes', 'MLPClassifier', 0.93333333333333335, 0.8296690708957613)
('CollabMap/Routes', 'DecisionTreeClassifier', 0.97692307692307689, 0.021202150965109468)
('CollabMap/Routes', 'SVC', 0.96025641025641029, 0.4666906089987606)
('CollabMap/Routes', 'KNeighborsClassifier', 0.96153846153846156, 0.04472280712798238)
('CollabMap/Routes', 'RandomForestClassifier', 0.982051282051282, 0.04239291697740555)
('CollabMap/Routes', 'AdaBoostClassifier', 0.95641025641025645, 0.5244158839341253)
('CollabMap/Routes', 'GradientBoostingClassifier', 0.96410256410256412, 0.3946032510139048)
('CollabMap/Routes', 'GaussianNB', 0.87820512820512819, 0.010561563074588776)
('CollabMap/Routes', 'QuadraticDiscriminantAnalysis', 0.5, 0.013449113117530942)
('CollabMap/Routes', 'SGDClassifier', 0.86282051282051286, 0.007984800031408668)
('CollabMap/Routes', 'MLPClassifier', 0.92307692307692313, 0.6299189948476851)
Original data shapes: (4710, 22) (4710,)
Balanced data shapes: (6038, 22) (6038,)
('CollabMap/Routesets', 'DecisionTreeClassifier', 0.95364238410596025, 0.022387400036677718)
('CollabMap/Routesets', 'SVC', 0.95033112582781454, 0.2934609961230308)
('CollabMap/Routesets', 'KNeighborsClassifier', 0.94867549668874174, 0.01796406600624323)
('CollabMap/Routesets', 'RandomForestClassifier', 0.96357615894039739, 0.04233773797750473)
('CollabMap/Routesets', 'AdaBoostClassifier', 0.94536423841059603, 0.40487608104012907)
('CollabMap/Routesets', 'GradientBoostingClassifier', 0.95364238410596025, 0.33734262199141085)
('CollabMap/Routesets', 'GaussianNB', 0.70529801324503316, 0.007951893145218492)
('CollabMap/Routesets', 'QuadraticDiscriminantAnalysis', 0.5, 0.009763133013620973)
('CollabMap/Routesets', 'SGDClassifier', 0.80629139072847678, 0.006294980179518461)
('CollabMap/Routesets', 'MLPClassifier', 0.91225165562913912, 0.6456035568844527)
('CollabMap/Routesets', 'DecisionTreeClassifier', 0.95529801324503316, 0.022938868962228298)
('CollabMap/Routesets', 'SVC', 0.93543046357615889, 0.2891701660118997)
('CollabMap/Routesets', 'KNeighborsClassifier', 0.9387417218543046, 0.01761500397697091)
('CollabMap/Routesets', 'RandomForestClassifier', 0.96192052980132448, 0.039801802951842546)
('CollabMap/Routesets', 'AdaBoostClassifier', 0.9370860927152318, 0.40900831390172243)
('CollabMap/Routesets', 'GradientBoostingClassifier', 0.95364238410596025, 0.3370290780439973)
('CollabMap/Routesets', 'GaussianNB', 0.72350993377483441, 0.008051329059526324)
('CollabMap/Routesets', 'QuadraticDiscriminantAnalysis', 0.5298013245033113, 0.009809479117393494)
('CollabMap/Routesets', 'SGDClassifier', 0.9056291390728477, 0.0062596299685537815)
('CollabMap/Routesets', 'MLPClassifier', 0.91059602649006621, 0.848628485109657)
('CollabMap/Routesets', 'DecisionTreeClassifier', 0.9668874172185431, 0.02195188100449741)
('CollabMap/Routesets', 'SVC', 0.95529801324503316, 0.29088515089824796)
('CollabMap/Routesets', 'KNeighborsClassifier', 0.95364238410596025, 0.01778050302527845)
('CollabMap/Routesets', 'RandomForestClassifier', 0.9668874172185431, 0.04070160584524274)
('CollabMap/Routesets', 'AdaBoostClassifier', 0.95695364238410596, 0.41058047697879374)
('CollabMap/Routesets', 'GradientBoostingClassifier', 0.9701986754966887, 0.3366008100565523)
('CollabMap/Routesets', 'GaussianNB', 0.71026490066225167, 0.00804082490503788)
('CollabMap/Routesets', 'QuadraticDiscriminantAnalysis', 0.62086092715231789, 0.009874872863292694)
('CollabMap/Routesets', 'SGDClassifier', 0.80132450331125826, 0.00629096501506865)
('CollabMap/Routesets', 'MLPClassifier', 0.86258278145695366, 0.4127837170381099)
('CollabMap/Routesets', 'DecisionTreeClassifier', 0.94039735099337751, 0.022793053183704615)
('CollabMap/Routesets', 'SVC', 0.94867549668874174, 0.29284341912716627)
('CollabMap/Routesets', 'KNeighborsClassifier', 0.94370860927152322, 0.017488260054960847)
('CollabMap/Routesets', 'RandomForestClassifier', 0.95033112582781454, 0.040734926937147975)
('CollabMap/Routesets', 'AdaBoostClassifier', 0.9387417218543046, 0.40877815312705934)
('CollabMap/Routesets', 'GradientBoostingClassifier', 0.94370860927152322, 0.3369099551346153)
('CollabMap/Routesets', 'GaussianNB', 0.71357615894039739, 0.008098769001662731)
('CollabMap/Routesets', 'QuadraticDiscriminantAnalysis', 0.69867549668874174, 0.009835758013650775)
('CollabMap/Routesets', 'SGDClassifier', 0.7483443708609272, 0.006303105968981981)
('CollabMap/Routesets', 'MLPClassifier', 0.91887417218543044, 0.5536460909061134)
('CollabMap/Routesets', 'DecisionTreeClassifier', 0.96026490066225167, 0.022474762052297592)
('CollabMap/Routesets', 'SVC', 0.95033112582781454, 0.2905781001318246)
('CollabMap/Routesets', 'KNeighborsClassifier', 0.95529801324503316, 0.018090395024046302)
('CollabMap/Routesets', 'RandomForestClassifier', 0.9668874172185431, 0.040718986885622144)
('CollabMap/Routesets', 'AdaBoostClassifier', 0.95529801324503316, 0.4052738829050213)
('CollabMap/Routesets', 'GradientBoostingClassifier', 0.95860927152317876, 0.3364135539159179)
('CollabMap/Routesets', 'GaussianNB', 0.73013245033112584, 0.008031239965930581)
('CollabMap/Routesets', 'QuadraticDiscriminantAnalysis', 0.55960264900662249, 0.009823711821809411)
('CollabMap/Routesets', 'SGDClassifier', 0.51986754966887416, 0.006246552104130387)
('CollabMap/Routesets', 'MLPClassifier', 0.92384105960264906, 0.6654085628688335)
('CollabMap/Routesets', 'DecisionTreeClassifier', 0.95695364238410596, 0.02247065701521933)
('CollabMap/Routesets', 'SVC', 0.96192052980132448, 0.2903721600305289)
('CollabMap/Routesets', 'KNeighborsClassifier', 0.95695364238410596, 0.017271067947149277)
('CollabMap/Routesets', 'RandomForestClassifier', 0.9701986754966887, 0.03956448216922581)
('CollabMap/Routesets', 'AdaBoostClassifier', 0.95198675496688745, 0.4108170981053263)
('CollabMap/Routesets', 'GradientBoostingClassifier', 0.96026490066225167, 0.33951162500306964)
('CollabMap/Routesets', 'GaussianNB', 0.74006622516556286, 0.007972134975716472)
('CollabMap/Routesets', 'QuadraticDiscriminantAnalysis', 0.5, 0.009750866796821356)
('CollabMap/Routesets', 'SGDClassifier', 0.80629139072847678, 0.006307743955403566)
('CollabMap/Routesets', 'MLPClassifier', 0.91887417218543044, 0.6905629721004516)
('CollabMap/Routesets', 'DecisionTreeClassifier', 0.96357615894039739, 0.02231321996077895)
('CollabMap/Routesets', 'SVC', 0.95198675496688745, 0.29159425594843924)
('CollabMap/Routesets', 'KNeighborsClassifier', 0.95364238410596025, 0.01769202691502869)
('CollabMap/Routesets', 'RandomForestClassifier', 0.97185430463576161, 0.0407216539606452)
('CollabMap/Routesets', 'AdaBoostClassifier', 0.95033112582781454, 0.40757188596762717)
('CollabMap/Routesets', 'GradientBoostingClassifier', 0.96192052980132448, 0.3426747820340097)
('CollabMap/Routesets', 'GaussianNB', 0.72682119205298013, 0.007955130888149142)
('CollabMap/Routesets', 'QuadraticDiscriminantAnalysis', 0.61092715231788075, 0.00976023287512362)
('CollabMap/Routesets', 'SGDClassifier', 0.82284768211920534, 0.006251373095437884)
('CollabMap/Routesets', 'MLPClassifier', 0.88245033112582782, 0.4838518360629678)
('CollabMap/Routesets', 'DecisionTreeClassifier', 0.94867549668874174, 0.021686450811102986)
('CollabMap/Routesets', 'SVC', 0.94039735099337751, 0.28616546490229666)
('CollabMap/Routesets', 'KNeighborsClassifier', 0.95198675496688745, 0.017545380163937807)
('CollabMap/Routesets', 'RandomForestClassifier', 0.96192052980132448, 0.03954092087224126)
('CollabMap/Routesets', 'AdaBoostClassifier', 0.95198675496688745, 0.4065771880559623)
('CollabMap/Routesets', 'GradientBoostingClassifier', 0.9668874172185431, 0.34199658781290054)
('CollabMap/Routesets', 'GaussianNB', 0.72350993377483441, 0.00795102003030479)
('CollabMap/Routesets', 'QuadraticDiscriminantAnalysis', 0.51821192052980136, 0.009783991845324636)
('CollabMap/Routesets', 'SGDClassifier', 0.82947019867549665, 0.006266012089326978)
('CollabMap/Routesets', 'MLPClassifier', 0.92218543046357615, 0.4708578719291836)
('CollabMap/Routesets', 'DecisionTreeClassifier', 0.96192052980132448, 0.02254656609147787)
('CollabMap/Routesets', 'SVC', 0.95033112582781454, 0.2978651749435812)
('CollabMap/Routesets', 'KNeighborsClassifier', 0.95033112582781454, 0.017515859100967646)
('CollabMap/Routesets', 'RandomForestClassifier', 0.97682119205298013, 0.04210781003348529)
('CollabMap/Routesets', 'AdaBoostClassifier', 0.94701986754966883, 0.4084370299242437)
('CollabMap/Routesets', 'GradientBoostingClassifier', 0.96192052980132448, 0.34344770293682814)
('CollabMap/Routesets', 'GaussianNB', 0.72847682119205293, 0.00797056290321052)
('CollabMap/Routesets', 'QuadraticDiscriminantAnalysis', 0.67052980132450335, 0.009791651042178273)
('CollabMap/Routesets', 'SGDClassifier', 0.92715231788079466, 0.006243706913664937)
('CollabMap/Routesets', 'MLPClassifier', 0.92549668874172186, 0.5744108147919178)
('CollabMap/Routesets', 'DecisionTreeClassifier', 0.9700996677740864, 0.02246780996210873)
('CollabMap/Routesets', 'SVC', 0.95182724252491691, 0.29382482497021556)
('CollabMap/Routesets', 'KNeighborsClassifier', 0.95681063122923593, 0.017376876901835203)
('CollabMap/Routesets', 'RandomForestClassifier', 0.96677740863787376, 0.04088546405546367)
('CollabMap/Routesets', 'AdaBoostClassifier', 0.94850498338870437, 0.4093765649013221)
('CollabMap/Routesets', 'GradientBoostingClassifier', 0.96511627906976749, 0.33941993792541325)
('CollabMap/Routesets', 'GaussianNB', 0.74252491694352163, 0.007861454971134663)
('CollabMap/Routesets', 'QuadraticDiscriminantAnalysis', 0.71594684385382057, 0.009866333100944757)
('CollabMap/Routesets', 'SGDClassifier', 0.84053156146179397, 0.006350570125505328)
('CollabMap/Routesets', 'MLPClassifier', 0.9169435215946844, 0.8341562729328871)
Original data shapes: (69, 22) (69,)
Balanced data shapes: (74, 22) (74,)
('RRG/k=11', 'DecisionTreeClassifier', 1.0, 0.0014888178557157516)
('RRG/k=11', 'SVC', 0.75, 0.0013928350526839495)
('RRG/k=11', 'KNeighborsClassifier', 1.0, 0.001620562979951501)
('RRG/k=11', 'RandomForestClassifier', 1.0, 0.011498658917844296)
('RRG/k=11', 'AdaBoostClassifier', 1.0, 0.0583934560418129)
('RRG/k=11', 'GradientBoostingClassifier', 1.0, 0.03988169599324465)
('RRG/k=11', 'GaussianNB', 0.875, 0.0019737021066248417)
('RRG/k=11', 'QuadraticDiscriminantAnalysis', 0.5, 0.0019838321022689342)
('RRG/k=11', 'SGDClassifier', 0.5, 0.0014307887759059668)
('RRG/k=11', 'MLPClassifier', 0.5, 0.003501111175864935)
('RRG/k=11', 'DecisionTreeClassifier', 0.625, 0.001395893981680274)
('RRG/k=11', 'SVC', 0.625, 0.00147455302067101)
('RRG/k=11', 'KNeighborsClassifier', 0.75, 0.0016390657983720303)
('RRG/k=11', 'RandomForestClassifier', 0.75, 0.011266457848250866)
('RRG/k=11', 'AdaBoostClassifier', 0.75, 0.05869359220378101)
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/sklearn/discriminant_analysis.py:719: RuntimeWarning: divide by zero encountered in power
  X2 = np.dot(Xm, R * (S ** (-0.5)))
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/sklearn/discriminant_analysis.py:719: RuntimeWarning: invalid value encountered in multiply
  X2 = np.dot(Xm, R * (S ** (-0.5)))
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/sklearn/discriminant_analysis.py:722: RuntimeWarning: divide by zero encountered in log
  u = np.asarray([np.sum(np.log(s)) for s in self.scalings_])
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/sklearn/discriminant_analysis.py:719: RuntimeWarning: divide by zero encountered in power
  X2 = np.dot(Xm, R * (S ** (-0.5)))
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/sklearn/discriminant_analysis.py:719: RuntimeWarning: invalid value encountered in multiply
  X2 = np.dot(Xm, R * (S ** (-0.5)))
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/sklearn/discriminant_analysis.py:722: RuntimeWarning: divide by zero encountered in log
  u = np.asarray([np.sum(np.log(s)) for s in self.scalings_])
('RRG/k=11', 'GradientBoostingClassifier', 0.75, 0.041996494168415666)
('RRG/k=11', 'GaussianNB', 0.875, 0.001696938183158636)
('RRG/k=11', 'QuadraticDiscriminantAnalysis', 0.5, 0.0019595439080148935)
('RRG/k=11', 'SGDClassifier', 0.75, 0.0013615519274026155)
('RRG/k=11', 'MLPClassifier', 0.5, 0.006086983950808644)
('RRG/k=11', 'DecisionTreeClassifier', 1.0, 0.0012267990969121456)
('RRG/k=11', 'SVC', 1.0, 0.0015221829526126385)
('RRG/k=11', 'KNeighborsClassifier', 1.0, 0.001708680996671319)
('RRG/k=11', 'RandomForestClassifier', 0.875, 0.011418401962146163)
('RRG/k=11', 'AdaBoostClassifier', 1.0, 0.06060530710965395)
('RRG/k=11', 'GradientBoostingClassifier', 0.875, 0.040803274139761925)
('RRG/k=11', 'GaussianNB', 1.0, 0.001698334002867341)
('RRG/k=11', 'QuadraticDiscriminantAnalysis', 0.5, 0.00198193802498281)
('RRG/k=11', 'SGDClassifier', 0.625, 0.0013500789646059275)
('RRG/k=11', 'MLPClassifier', 0.5, 0.0035098791122436523)
('RRG/k=11', 'DecisionTreeClassifier', 0.875, 0.0012260228395462036)
('RRG/k=11', 'SVC', 1.0, 0.0013764961622655392)
('RRG/k=11', 'KNeighborsClassifier', 1.0, 0.0016852649860084057)
('RRG/k=11', 'RandomForestClassifier', 1.0, 0.011407349957153201)
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/sklearn/discriminant_analysis.py:719: RuntimeWarning: divide by zero encountered in power
  X2 = np.dot(Xm, R * (S ** (-0.5)))
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/sklearn/discriminant_analysis.py:719: RuntimeWarning: invalid value encountered in multiply
  X2 = np.dot(Xm, R * (S ** (-0.5)))
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/sklearn/discriminant_analysis.py:722: RuntimeWarning: divide by zero encountered in log
  u = np.asarray([np.sum(np.log(s)) for s in self.scalings_])
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/sklearn/discriminant_analysis.py:719: RuntimeWarning: divide by zero encountered in power
  X2 = np.dot(Xm, R * (S ** (-0.5)))
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/sklearn/discriminant_analysis.py:719: RuntimeWarning: invalid value encountered in multiply
  X2 = np.dot(Xm, R * (S ** (-0.5)))
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/sklearn/discriminant_analysis.py:722: RuntimeWarning: divide by zero encountered in log
  u = np.asarray([np.sum(np.log(s)) for s in self.scalings_])
('RRG/k=11', 'AdaBoostClassifier', 1.0, 0.059438828844577074)
('RRG/k=11', 'GradientBoostingClassifier', 1.0, 0.04061459400691092)
('RRG/k=11', 'GaussianNB', 0.875, 0.0016240121331065893)
('RRG/k=11', 'QuadraticDiscriminantAnalysis', 0.5, 0.0018182999920099974)
('RRG/k=11', 'SGDClassifier', 0.75, 0.001461958047002554)
('RRG/k=11', 'MLPClassifier', 0.5, 0.003513803007081151)
('RRG/k=11', 'DecisionTreeClassifier', 0.875, 0.0012086979113519192)
('RRG/k=11', 'SVC', 0.75, 0.0013758628629148006)
('RRG/k=11', 'KNeighborsClassifier', 0.875, 0.0016258079558610916)
('RRG/k=11', 'RandomForestClassifier', 0.875, 0.011359950061887503)
('RRG/k=11', 'AdaBoostClassifier', 0.875, 0.059313332894816995)
('RRG/k=11', 'GradientBoostingClassifier', 0.875, 0.04020914598368108)
('RRG/k=11', 'GaussianNB', 0.375, 0.0016456039156764746)
('RRG/k=11', 'QuadraticDiscriminantAnalysis', 0.5, 0.0017497390508651733)
('RRG/k=11', 'SGDClassifier', 0.5, 0.0018051520455628633)
('RRG/k=11', 'MLPClassifier', 0.5, 0.006517163012176752)
('RRG/k=11', 'DecisionTreeClassifier', 0.625, 0.0013907940592616796)
('RRG/k=11', 'SVC', 0.625, 0.0014790929853916168)
('RRG/k=11', 'KNeighborsClassifier', 1.0, 0.0017442090902477503)
('RRG/k=11', 'RandomForestClassifier', 1.0, 0.01228414406068623)
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/sklearn/discriminant_analysis.py:719: RuntimeWarning: divide by zero encountered in power
  X2 = np.dot(Xm, R * (S ** (-0.5)))
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/sklearn/discriminant_analysis.py:719: RuntimeWarning: invalid value encountered in multiply
  X2 = np.dot(Xm, R * (S ** (-0.5)))
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/sklearn/discriminant_analysis.py:722: RuntimeWarning: divide by zero encountered in log
  u = np.asarray([np.sum(np.log(s)) for s in self.scalings_])
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/sklearn/discriminant_analysis.py:719: RuntimeWarning: divide by zero encountered in power
  X2 = np.dot(Xm, R * (S ** (-0.5)))
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/sklearn/discriminant_analysis.py:719: RuntimeWarning: invalid value encountered in multiply
  X2 = np.dot(Xm, R * (S ** (-0.5)))
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/sklearn/discriminant_analysis.py:722: RuntimeWarning: divide by zero encountered in log
  u = np.asarray([np.sum(np.log(s)) for s in self.scalings_])
('RRG/k=11', 'AdaBoostClassifier', 0.75, 0.061202650889754295)
('RRG/k=11', 'GradientBoostingClassifier', 0.5, 0.04074292699806392)
('RRG/k=11', 'GaussianNB', 1.0, 0.0016191198956221342)
('RRG/k=11', 'QuadraticDiscriminantAnalysis', 0.5, 0.0019401400350034237)
('RRG/k=11', 'SGDClassifier', 0.5, 0.001365691889077425)
('RRG/k=11', 'MLPClassifier', 0.75, 0.004001813009381294)
('RRG/k=11', 'DecisionTreeClassifier', 1.0, 0.0012713468167930841)
('RRG/k=11', 'SVC', 0.875, 0.001363782910630107)
('RRG/k=11', 'KNeighborsClassifier', 0.875, 0.0017453581094741821)
('RRG/k=11', 'RandomForestClassifier', 0.875, 0.011464271927252412)
('RRG/k=11', 'AdaBoostClassifier', 0.875, 0.05868331203237176)
('RRG/k=11', 'GradientBoostingClassifier', 0.875, 0.04217184684239328)
('RRG/k=11', 'GaussianNB', 0.75, 0.0016572889871895313)
('RRG/k=11', 'QuadraticDiscriminantAnalysis', 0.5, 0.0017651650123298168)
('RRG/k=11', 'SGDClassifier', 0.5, 0.001581565011292696)
('RRG/k=11', 'MLPClassifier', 0.5, 0.0035655731335282326)
('RRG/k=11', 'DecisionTreeClassifier', 0.66666666666666663, 0.0013633829075843096)
('RRG/k=11', 'SVC', 0.5, 0.001506275963038206)
('RRG/k=11', 'KNeighborsClassifier', 0.66666666666666663, 0.0017049508169293404)
('RRG/k=11', 'RandomForestClassifier', 0.66666666666666663, 0.011329544940963387)
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/sklearn/discriminant_analysis.py:719: RuntimeWarning: divide by zero encountered in power
  X2 = np.dot(Xm, R * (S ** (-0.5)))
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/sklearn/discriminant_analysis.py:719: RuntimeWarning: invalid value encountered in multiply
  X2 = np.dot(Xm, R * (S ** (-0.5)))
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/sklearn/discriminant_analysis.py:722: RuntimeWarning: divide by zero encountered in log
  u = np.asarray([np.sum(np.log(s)) for s in self.scalings_])
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/sklearn/discriminant_analysis.py:719: RuntimeWarning: divide by zero encountered in power
  X2 = np.dot(Xm, R * (S ** (-0.5)))
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/sklearn/discriminant_analysis.py:719: RuntimeWarning: invalid value encountered in multiply
  X2 = np.dot(Xm, R * (S ** (-0.5)))
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/sklearn/discriminant_analysis.py:722: RuntimeWarning: divide by zero encountered in log
  u = np.asarray([np.sum(np.log(s)) for s in self.scalings_])
('RRG/k=11', 'AdaBoostClassifier', 0.66666666666666663, 0.0598211910109967)
('RRG/k=11', 'GradientBoostingClassifier', 0.66666666666666663, 0.04210941190831363)
('RRG/k=11', 'GaussianNB', 0.66666666666666663, 0.0017534890212118626)
('RRG/k=11', 'QuadraticDiscriminantAnalysis', 0.5, 0.0018699930515140295)
('RRG/k=11', 'SGDClassifier', 0.5, 0.0013889288529753685)
('RRG/k=11', 'MLPClassifier', 0.5, 0.006494611036032438)
('RRG/k=11', 'DecisionTreeClassifier', 0.83333333333333337, 0.0012138860765844584)
('RRG/k=11', 'SVC', 0.83333333333333337, 0.0013793460093438625)
('RRG/k=11', 'KNeighborsClassifier', 0.83333333333333337, 0.0017623151652514935)
('RRG/k=11', 'RandomForestClassifier', 0.83333333333333337, 0.011491863988339901)
('RRG/k=11', 'AdaBoostClassifier', 0.83333333333333337, 0.05878868489526212)
('RRG/k=11', 'GradientBoostingClassifier', 0.83333333333333337, 0.041642566910013556)
('RRG/k=11', 'GaussianNB', 0.66666666666666663, 0.0016783999744802713)
('RRG/k=11', 'QuadraticDiscriminantAnalysis', 0.5, 0.0017852059099823236)
('RRG/k=11', 'SGDClassifier', 0.5, 0.001595557201653719)
('RRG/k=11', 'MLPClassifier', 0.5, 0.003532076021656394)
('RRG/k=11', 'DecisionTreeClassifier', 1.0, 0.0012313551269471645)
('RRG/k=11', 'SVC', 0.83333333333333337, 0.0013750740326941013)
('RRG/k=11', 'KNeighborsClassifier', 1.0, 0.0016042978968471289)
('RRG/k=11', 'RandomForestClassifier', 1.0, 0.011315866140648723)
('RRG/k=11', 'AdaBoostClassifier', 0.83333333333333337, 0.059280174784362316)
('RRG/k=11', 'GradientBoostingClassifier', 1.0, 0.043036176823079586)
('RRG/k=11', 'GaussianNB', 1.0, 0.0016102669760584831)
('RRG/k=11', 'QuadraticDiscriminantAnalysis', 0.5, 0.0018779360689222813)
('RRG/k=11', 'SGDClassifier', 0.5, 0.0015114808920770884)
('RRG/k=11', 'MLPClassifier', 0.5, 0.00437734485603869)
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/sklearn/discriminant_analysis.py:719: RuntimeWarning: divide by zero encountered in power
  X2 = np.dot(Xm, R * (S ** (-0.5)))
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/sklearn/discriminant_analysis.py:719: RuntimeWarning: invalid value encountered in multiply
  X2 = np.dot(Xm, R * (S ** (-0.5)))
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/sklearn/discriminant_analysis.py:722: RuntimeWarning: divide by zero encountered in log
  u = np.asarray([np.sum(np.log(s)) for s in self.scalings_])
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/sklearn/discriminant_analysis.py:719: RuntimeWarning: divide by zero encountered in power
  X2 = np.dot(Xm, R * (S ** (-0.5)))
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/sklearn/discriminant_analysis.py:719: RuntimeWarning: invalid value encountered in multiply
  X2 = np.dot(Xm, R * (S ** (-0.5)))
/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/sklearn/discriminant_analysis.py:722: RuntimeWarning: divide by zero encountered in log
  u = np.asarray([np.sum(np.log(s)) for s in self.scalings_])

Saving the results to a file so that we will not need to run the lengthy tests above again the next time:


In [11]:
performance.to_pickle('performance.pkl')

In [12]:
performance = pd.read_pickle('performance.pkl')

Charting the relative performance

In this section, we plot the measurements (accuracy and time) collected above to compare the relative difference in the performance of the tested against that of Decision Tree classifiers.

Since we have normalised the measurements agains those of Decision Tree classifiers, the measurements of Decision Tree classifiers average at 1.0 (for both accuracy and time). Therefore, to simplify the charts, we remove them from the plotted data.


In [13]:
# Dropping measurements of DecisionTreeClassifier
performance = performance[~(performance.classifier == 'DecisionTreeClassifier')]

In [14]:
%matplotlib inline
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")

Comparing Accuracy Score


In [15]:
plot = sns.barplot(x='dataset', y='accuracy', hue='classifier', data=performance, errwidth=1, capsize=0.02)
plot.hlines(1, -0.5, 4.5, linestyle='--', linewidth=1)
plot.legend(loc='upper center', bbox_to_anchor=(0.5, 1.2), ncol=3)
plot.set_xlabel('Dataset')
plot.set_ylabel('Relative Accuracy (over 1.0 is better)')
plot.figure.set_size_inches(16, 9)


Results: The chart above shows that, with all the 5 datasets, all other classifiers yield the same level of accuracy as that of Decision Tree classifiers (1.0) or worse, with the only exception of the performance of Random Forest classifiers on the CollabMap/Routes and CollabMap/Routesets datasets. Having said that, the improvement in accuracy is only about 1% as shown below.

Some algorithms have better mean accuracy on the RRG/k=11 dataset, but the 95% confidence intervals are too broad for us to say conclusively that they are actually better than the decision tree classifier.


In [16]:
performance[
    (performance.classifier == 'RandomForestClassifier') & (performance.dataset == 'CollabMap/Routes')
].mean()


Out[16]:
accuracy    1.006878
time        1.944126
dtype: float64

In [17]:
performance[
    (performance.classifier == 'RandomForestClassifier') & (performance.dataset == 'CollabMap/Routesets')
].mean()


Out[17]:
accuracy    1.008296
time        1.817231
dtype: float64

Comparing Computing Time


In [18]:
plot = sns.barplot(x='dataset', y='time', hue='classifier', data=performance, errwidth=1, capsize=0.02)
plot.figure.set_size_inches(16, 9)
plot.hlines(1, -0.5, 4.5, linestyle='--', linewidth=1)
plot.legend(loc='upper center', bbox_to_anchor=(0.5, 1.2), ncol=3)
plot.set_xlabel('Dataset')
plot.set_ylabel('Relative Computing Time (over 1.0 is slower)')
plot.set_yscale('log')


Results: The algorithms that are faster than decision tree classification in some cases (GaussianNB, QuadraticDiscriminantAnalysis, and SGDCClassifier) are those that performed significantly worse in our previous chart (across 5 datasets). The slightly more accurate algorithm (RandomForestClassifier) takes nearly double the time compared to the decision tree classifier.

Conclusion: From the available classification algorithms provided by the scikit-learn package, DecisionTreeClassifier is the best choice for our work as it is fast and provides high accuracy on the datasets that we investigate.